Skip to content

[Content Understanding] Update toLlmInput page markers and filter LLMStats telemetry#38851

Draft
chienyuanchang wants to merge 2 commits into
mainfrom
cu-sdk/llm-input-helper-update
Draft

[Content Understanding] Update toLlmInput page markers and filter LLMStats telemetry#38851
chienyuanchang wants to merge 2 commits into
mainfrom
cu-sdk/llm-input-helper-update

Conversation

@chienyuanchang
Copy link
Copy Markdown
Member

Packages impacted by this PR

  • @azure/ai-content-understanding

Issues associated with this PR

Describe the problem that is addressed by this PR

The toLlmInput() helper renders Content Understanding AnalysisResult objects into LLM-friendly text. Two output-hygiene issues need to be addressed before the next CU service release:

  1. The SDK emits page boundary markers as <!-- page N -->. The upcoming service release (per ContentUnderstanding-Docs#249) will emit the same boundary using <!-- InputPageNumber: N -->. The SDK should adopt the new format and avoid emitting duplicate markers when the service-supplied markdown already contains them.
  2. The service occasionally surfaces internal telemetry strings (e.g. LLMStats: completion calls: 2; embedding calls: 1; completion latency: 7.71s) in the warnings collection. These are not Responsible-AI warnings, and downstream consumers (Agent Framework, LangChain) currently strip them with local regex workarounds. The SDK should filter them at the source so the noise never reaches the LLM-facing rai_warnings block.

What are the possible designs available to address the problem? If there are more than one possible design, why was the one in this PR chosen?

This PR makes the smallest possible surface change inside toLlmInput():

  • Page marker constant + guard. Add an INPUT_PAGE_MARKER_PREFIX constant and a hasInputPageMarker() check at the top of addPageMarkers(). If the markdown already includes any <!-- InputPageNumber: substring (case-sensitive), pass the markdown through unchanged. Otherwise inject the new-format marker via the existing spans / PageBreak paths.
  • Telemetry filter on warnings. Add a TELEMETRY_MESSAGE_PREFIXES = ["LLMStats:"] list and a small isTelemetryMessage() predicate. Inside formatWarnings(), skip entries whose message (after trimming leading whitespace) starts with any prefix. Filtering is scoped to the structured warnings list only; the document markdown body is never inspected, so legitimate LLMStats: text in documents is preserved.

Alternative considered: post-rendering regex on the YAML output (the workaround currently used by Agent Framework). Rejected because operating on the structured list before rendering is simpler, more robust to YAML escaping, and idempotent.

Are there test cases added in this PR? (If not, why?)

Yes. Updated existing tests for the new marker format and added six new unit tests:

  • Duplicate-marker suppression when service markdown already contains markers.
  • LLMStats: warnings dropped while real warnings are kept.
  • rai_warnings block omitted entirely when only LLMStats: warnings exist.
  • Case-sensitive filter (lowercase llmstats: is preserved).
  • Markdown body containing literal LLMStats: text is preserved verbatim.
  • Leading-whitespace LLMStats: warnings are filtered.

All 37 unit tests in test/public/node/llmInputHelper.spec.ts pass locally.

Provide a list of related PRs (if any)

Companion PRs in sibling SDKs:

Command used to generate this PR:**(Applicable only to SDK release request PRs)

Not applicable. This PR modifies hand-authored helper code; no regeneration was performed.

Checklists

  • Added impacted package name to the issue description
  • Does this PR needs any fixes in the SDK Generator? — No. Helper lives in src/static-helpers/llmInputHelper.ts (not generated).
  • Added a changelog (if necessary) — CHANGELOG.md updated under 1.2.0-beta.2 (Unreleased).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants